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Abstract 

Motivated by the increasing need to understand the algorithmic foundations of distributed 
large-scale graph computations, we study a number of fundamental graph problems in a message¬ 
passing model for distributed computing where k > 2 machines jointly perform computations on 
graphs with n nodes (typically, n^> k). The input graph is assumed to be initially randomly 
partitioned among the k machines, a common implementation in many real-world systems. 
Communication is point-to-point, and the goal is to minimize the number of communication 
rounds of the computation. 

Our main result is an (almost) optimal distributed randomized algorithm for graph con¬ 
nectivity. Our algorithm runs in 0(n/k 2 ) rounds (O notation hides a polylog(n) factor and 
an additive polylog(n) term). This improves over the best previously known bound of 0{n/k) 
[Klauck et al., SODA 2015], and is optimal (up to a polylogarithmic factor) in view of an existing 
lower bound of kl{n/k 2 ). Our improved algorithm uses a bunch of techniques, including linear 
graph sketching, that prove useful in the design of efficient distributed graph algorithms. Using 
the connectivity algorithm as a building block, we then present fast randomized algorithms for 
computing minimum spanning trees, (approximate) min-cuts, and for many graph verification 
problems. All these algorithms take 0(n/k 2 ) rounds, and are optimal up to polylogarithmic 
factors. We also show an almost matching lower bound of Q(n/k 2 ) rounds for many graph 
verification problems by leveraging lower bounds in random-partition communication complexity. 
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1 Introduction 


The focus of this paper is on distributed computation on large-scale graphs, which is increasingly 
becoming important with the rise of massive graphs such as the Web graph, social networks, 
biological networks, and other graph-structured data and the consequent need for fast algorithms to 
process such graphs. Several large-scale graph processing systems such as Pregel [31] and Giraph [[[] 
have been recently designed based on the message-passing distributed computing model mm- 
We study a number of fundamental graph problems in a model which abstracts the essence of 
these graph-processing systems, and present almost tight bounds on the time complexity needed 
to solve these problems. In this model, introduced in [22] and explained in detail in Section m 
the input graph is distributed across a group of k > 2 machines that are pairwise interconnected 
via a communication network. The k machines jointly perform computations on an arbitrary 
n-vertex input graph, where typically n 3> k. The input graph is assumed to be initially randomly 
partitioned among the k machines (a common implementation in many real world graph processing 
systems pmui). Communication is point-to-point via message passing. The computation advances 
in synchronous rounds, and there is a constraint on the amount of data that can cross each link 
of the network in each round. The goal is to minimize the time complexity, i.e., the number of 
rounds required by the computation. This model is aimed at investigating the amount of “speed-up” 
possible vis-a-vis the number of available machines, in the following sense: when k machines are 
used, how does the time complexity scale in k ? Which problems admit linear scaling? Is it possible 
to achieve super-linear scaling? 

Klauck et al. [22] present lower and upper bounds for several fundamental graph problems 
in the ^-machine model. In particular, assuming that each link has a bandwidth of one bit per 
round, they show a lower bound of &(n/k 2 ) rounds for the graph connectivity problem^ They also 
present an 0(n/fc)-round algorithm for graph connectivity and spanning tree (ST) verification. This 
algorithm thus exhibits a scaling linear in the number of machines k. The question of existence of 
a faster algorithm, and in particular of an algorithm matching the Cl(n/k 2 ) lower bound, was left 
open in |22j . In this paper we answer this question affirmatively by presenting an 0(n/fc 2 )-round 
algorithm for graph connectivity, thus achieving a speedup quadratic in k. This is optimal up to 
polylogarithmic (in n) factors. 

This result is important for two reasons. First, it shows that there are non-trivial graph problems 
for which we can obtain superlinear (in k) speed-up. To elaborate further on this point, we shall take 
a closer look at the proof of the lower bound for connectivity shown in [22]. Using communication 
complexity techniques, that proof shows that any (possibly randomized) algorithm for the graph 
connectivity problem has to exchange f l(n) bits of information across the k machines, for any k > 2. 
Since there are k(k — l)/2 links in a complete network with k machines, when each link can carry 
0(polylog(n)) bits per round, in each single round the network can deliver at most 0(fc 2 ) bits of 
information, and thus a lower bound of Cl(n/k 2 ) rounds follows. The result of this paper thus shows 
that it is possible to exploit in full the available bandwidth, thus achieving a speed-up of Q(k 2 ). 
Second, this implies that many other important graph problems can be solved in 0(n/k 2 ) rounds 
as well. These include computing a spanning tree, minimum spanning tree (MST), approximate 
min-cut, and many verification problems such as spanning connected subgraph, cycle containment, 
and bipartiteness. 

It is important to note that under a different output requirement (explained next) there exists a 
throughout this paper 0(/(n)) denotes 0(/(n) polylog n + polylogn), and fdenotes SI(/(n)/polylogn). 
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r2(n/fc)-round lower bound for computing a spanning tree of a graph |22j . which also implies the 
same lower bound for other fundamental problems such as computing an MST, breadth-first tree, 
and shortest paths tree. However, this lower bound holds under the requirement that each vertex 
(i.e., the machine which hosts the vertex) must know at the end of the computation the “status” of 
all of its incident edges, that is, whether they belong to an ST or not, and output their respective 
status. (This is the output criterion that is usually required in distributed algorithms [5U1155] .) The 
proof of the lower bound exploits this criterion to show that any algorithm requires some machine 
receiving f2(n) bits of information, and since any machine has k — 1 incident links, this results in a 
D(n/k) lower bound. On the other hand, if we relax the output criterion to require the final status 
of each edge to be known by some machine, then we show that this can be accomplished in 0(n/k 2 ) 
rounds using the fast connectivity algorithm of this paper. 


1.1 The Model 


We now describe the adopted model of distributed computation, the k-machine model (a.k.a. the 
Big Data model), introduced in [22] and further investigated in [9| l40l 138] , The model consists of 
a set of k > 2 machines N = {Mi, M 2 ,..., M*.} that are pairwise interconnected by bidirectional 
point-to-point communication links. Each machine executes an instance of a distributed algorithm. 
The computation advances in synchronous rounds where, in each round, machines can exchange 
messages over their communication links and perform some local computation. Each link is assumed 
to have a bandwidth of 0(polylog(n)) bits per round, i.e., 0(polylog(n)) bits can be transmitted 
over each link in each round. (As discussed in [2‘2\ (cf. Theorem 4.1), it is easy to rewrite bounds to 
scale in terms of the actual inter-machine bandwidth.) Machines do not share any memory and 
have no other means of communication. There is an alternate (but equivalent) way to view this 
communication restriction: instead of putting a bandwidth restriction on the links, we can put a 
restriction on the amount of information that each machine can communicate (i.e., send/receive) in 
each round. The results that we obtain in the bandwidth-restricted model will also apply to the 
latter model |22j . Local computation within a machine is considered to happen instantaneously at 
zero cost, while the exchange of messages between machines is the costly operation. (However, we 
note that in all the algorithms of this paper, every machine in every round performs a computation 
bounded by a polynomial in n.) We assume that each machine has access to a private source of 
true random bits. 

Although the ^-machine model is a fairly general model of computation, we are mostly interested 
in studying graph problems in it. Specifically, we are given an input graph G with n vertices, each 
associated with a unique integer ID from [n], and m edges. To avoid trivialities, we will assume 
that n > k (typically, n> k). Initially, the entire graph G is not known by any single machine, but 
rather partitioned among the k machines in a “balanced” fashion, i.e., the nodes and/or edges of G 
are partitioned approximately evenly among the machines. We assume a vertex-partition model, 
whereby vertices, along with information of their incident edges, are partitioned across machines. 
Specifically, the type of partition that we will assume throughout is the random vertex partition 
(RVP'), that is, each vertex of the input graph is assigned randomly to one machine. (This is the 
typical way used by many real systems, such as Pregel m, to partition the input graph among the 
machines; it is easy to accomplish, e.g., via hashing^]) However, we notice that our upper bounds 


“In Section 1.3 we will discuss an alternate partitioning model, the random edge partition (REP) model, where 


each edge of G is assigned independently and randomly to one of the k machines, and show how the results in the 
random vertex partition model can be related to the random edge partition model. 
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also hold under the much weaker assumption whereby it is only required that nodes and edges of 
the input graph are partitioned approximately evenly among the machines; on the other hand, lower 
bounds under RVP clearly apply to worst-case partitions as well. 

More formally, in the random vertex partition variant, each vertex of G is assigned independently 
and uniformly at random to one of the k machines. If a vertex v is assigned to machine Mi we 
say that M t is the home machine of v and, with a slight abuse of notation, write v E Mj. When 
a vertex is assigned to a machine, all its incident edges are assigned to that machine as well; i.e., 
the home machine will know the IDs of the neighbors of that vertex as well as the identity of the 
home machines of the neighboring vertices (and the weights of the corresponding edges in case G 
is weighted). Note that an immediate property of the RVP model is that the number of vertices 
at each machine is balanced, i.e., each machine is the home machine of @(n/k) vertices with high 
probability. A convenient way to implement the RVP model is through hashing: each vertex (ID) is 
hashed to one of the k machines. Hence, if a machine knows a vertex ID, it also knows where it is 
hashed to. 

Eventually, each machine Mi, for each 1 < i < k, must set a designated local output variable 
Oj (which need not depend on the set of vertices assigned to Mi), and the output configuration 
o = (oi,..., of) must satisfy certain feasibility conditions for the problem at hand. For example, for 
the minimum spanning tree problem each Oj corresponds to a set of edges, and the edges in the 
union of such sets must form an MST of the input graph. 

In this paper, we show results for distributed algorithms that are Monte Carlo. Recall that 
a Monte Carlo algorithm is a randomized algorithm whose output may be incorrect with some 
probability. Formally, we say that an algorithm computes a function / with e-error if for every 
input it outputs the correct answer with probability at least 1 — e, where the probability is over 
the random partition and the random bit strings used by the algorithm (if any). The round (time) 
complexity of an algorithm is the maximum number of communication rounds until termination. 
For any n and problem V on n node graphs, we let the time complexity of solving V with e error 
probability in the A:-machine model, denoted by T e (fP), be the minimum T(n) such that there exists 
an e-error protocol that solves V and terminates in T(n) rounds. For any 0 < e < 1, graph problem 
V and function T : Z + — y Z + , we say that T e {V) = 0(T(n )) if there exists integer no and c such 
that for all n > no, %{V) < cT(n). Similarly, we say that T e {V) = fl(T(n)) if there exists integer 
no and real c such that for all n > no, T e (V) > cT(n). For our upper bounds, we will usually use 
e = 1/n, which will imply high probability algorithms, i.e., succeeding with probability at least 
1 — 1/n. In this case, we will sometimes just omit e and simply say the time bound applies “with 
high probability.” 

1.2 Our Contributions and Techniques 

The main result of this paper, presented in Section [2j is a randomized Monte Carlo algorithm in the 
^-machine model that determines the connected components of an undirected graph G correctly 
with high probability and that terminates in 0(n/k 2 ) rounds]^] This improves upon the previous 
best bound of 0(n/k) [22j, since it is strictly superior in the wide range of parameter k = 0(n e ), 
for all constants e E (0,1). Improving over this bound is non-trivial since various attempts to get a 
faster connectivity algorithm fail due to the fact that they end up congesting a particular machine 

3 Since the focus is on the scaling of the time complexity with respect to k, we omit explicitly stating the 
polylogarithmic factors in our run time bounds. However, the hidden polylogarithmic factor is not large—at most 
D(log 3 n). 
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too much, i.e., up to n bits may need to be sent/received by a machine, leading to a 0(n/k) bound 
(as a machine has only k — 1 links). For example, a simple algorithm for connectivity is simply 
flooding: each vertex floods the lowest labeled vertex that it has seen so far; at the end each vertex 
will have the label of the lowest labeled vertex in its component]/] It can be shown that the above 
algorithm takes Q(n/k + D) rounds (where D is the graph diameter) in the /c-macliine model by 
using the Conversion Theorem of [ 22 ] • Hence new techniques are needed to break the n/fc-round 
barrier. 

Our connectivity algorithm is the result of the application of the following three techniques. 

1. Randomized Proxy Computation. This technique, similar to known techniques used in 
randomized routing algorithms |44j . is used to load-balance congestion at any given machine by 
redistributing it evenly across the k machines. This is achieved, roughly speaking, by re-assigning the 
executions of individual nodes uniformly at random among the machines. It is crucial to distribute 
the computation and communication across machines to avoid congestion at any particular machine. 
In fact, this allows one to move away from the communication pattern imposed by the topology 
of the input graph (which can cause congestion at a particular machine) to a more balanced 
communication. 

2. Distributed Random Ranking (DRR). DRR [5] is a simple technique that will be used to build 
trees of low height in the connectivity algorithm. Our connectivity algorithm is divided into phases, 
in each of which we do the following: each current component (in the first phase, each vertex is 
a component by itself) chooses one outgoing edge and then components are combined by merging 
them along outgoing edges. If done naively, this may result in a long chain of merges, resulting 
in a component tree of high diameter; communication along this tree will then take a long time. 
To avoid this we resort to DRR, which suitably reduces the number of merges. With DRR, each 
component chooses a random rank, which is simply a random number, say in the interval [l,n 3 ]; a 
component Ci then merges with the component Cj on the other side of its selected outgoing edge if 
and only if the rank of Cj is larger than the rank of C\. Otherwise, Ci does not merge with Cj, and 
thus it becomes the root of a DRR tree, which is a tree induced by the components and the set of 
the outgoing edges that have been used in the above merging procedure. It can be shown that the 
height of a DRR tree is bounded by O(logn) with high probability. 

3. Linear Graph Sketching. Linear graph sketching [H El [32] is crucially helpful in efficiently 

finding an outgoing edge of a component. A sketch for a vertex (or a component) is a short 
(O(polylogn)) bit vector that efficiently encodes the adjacency list of the vertex. Sampling from 
this sketch gives a random (outgoing) edge of this vertex (component). A very useful property is the 
linearity of the sketches: adding the sketches of a set of vertices gives the sketch of the component 
obtained by combining the vertices; the edges between the vertices (i.e., the intra-component edges) 
are automatically “cancelled”, leaving only a sketch of the outgoing edges. Linear graph sketches 
were originally used to process dynamic graphs in the (semi-) streaming model Here, 

in a distributed setting, we use them to reduce the amount of communication needed to find an 
outgoing edge; in particular, graph sketches will avoid us from checking whether an edge is an 
inter-component or an intra-component edge, and this will crucially reduce communication across 
machines. We note that earlier distributed algorithms such as the classical GHS algorithm [0] for 
the MST problem would incur too much communication since they involve checking the status of 
each edge of the graph. 

We observe that it does not seem straightforward to effectively exploit these techniques in the 
4 This algorithm has been implemented in a variant of Giraph 1431 . 
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/c-machine model: for example, linear sketches can be easily applied in the distributed streaming 
model by sending to a coordinator machine the sketches of the partial stream, which then will be 
added to obtain the sketch of the entire stream. Mimicking this trivial strategy in the /c-machine 
model model would cause too much congestion at one node, leading to a 0{n/k ) time bound. 

Using the above techniques and the fast connectivity algorithm, in Section [3] we give algorithms 
for many other important graph problems. In particular, we present a 0{n/k 2 )-round algorithm for 
computing an MST (and hence an ST). We also present 0(n/k 2 )-round algorithms for approximate 
min-cut, and for many graph verification problems including spanning connected subgraph, cycle 
containment, and bipartiteness. All these algorithms are optimal up to a polylogarithmic factor. 

In Section [ 4 ] we show a lower bound of Q(n/k 2 ) rounds for many verification problems by 
simulating the /c-machine model in a 2-party model of communication complexity where the inputs 
are randomly assigned to the players. 

1.3 Related Work 

The theoretical study of large-scale graph computations in distributed systems is relatively new. 
Several works have been devoted to developing MapReduce graph algorithms (see, e.g., [201 f25| [28] 
and references therein). We note that the flavor of the theory developed for MapReduce is quite 
different compared to the one for the /c-machine model. Minimizing communication is also the key 
goal in MapReduce algorithms; however this is usually achieved by making sure that the data is 
made small enough quickly (that is, in a small number of MapReduce rounds) to fit into the memory 
of a single machine (see, e.g., the MapReduce algorithm for MST in [25]). 

For a comparison of the /c-machine model with other models for parallel and distributed processing, 
including Bulk-Synchronous Parallel (BSP) model [15], MapReduce [20], and the congested clique, 
we refer to [461. In particular, according to [46], “Among all models with restricted communication 
the “big data” [/c-machine] model is the one most similar to the MapReduce model”. 

The /c-machine model is closely related to the BSP model; it can be considered to be a simplified 
version of BSP, where the costs of local computation and of synchronization (which happens at 
the end of every round) are ignored. Unlike the BSP and refinements thereof, which have several 
different parameters that make the analysis of algorithms complicated [46], the /c-machine model is 
characterized by just one parameter, the number of machines; this makes the model simple enough 
to be analytically tractable, thus easing the job of designing and analyzing algorithms, while at the 
same time it still captures the key features of large-scale distributed computations. 

The /c-machine model is related to the classical COMQEST model [ 39] . and in particular to 
the congested clique model, which recently has received considerable attention (see, e.g., ESI2ZI 
m es im 0 116]). The main difference is that the /c-machine model is aimed at the study of 
large-scale computations, where the size n of the input is significantly bigger than the number of 
available machines k, and thus many vertices of the input graph are mapped to the same machine, 
whereas the two aforementioned models are aimed at the study of distributed network algorithms, 
where n = k and thus each vertex corresponds to a dedicated machine. More “local knowledge” 
is available per vertex (since it can access for free information about other vertices in the same 
machine) in the /c-machine model compared to the other two models. On the other hand, all vertices 
assigned to a machine have to communicate through the links incident on this machine, which can 
limit the bandwidth (unlike the other two models where each vertex has a dedicated processor). 
These differences manifest in the time complexity. In particular, the fastest known distributed 
algorithm in the congested clique model for a given problem may not give rise to the fastest algorithm 
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in the /c-machine model. For example, the fastest algorithms for MST in the congested clique 
model ( |29l . 16]) require @(n 2 ) messages; implementing these algorithms in the fc-machine model 
requires Q(n 2 /k 2 ) rounds. Conversely, the slower GHS algorithm [f3J gives an 0(n/k ) bound in the 
/c-machine model. The recently developed techniques (see, e.g., dDEa m Ea e]) used to prove 
time lower bounds in the standard CONGEST model and in the congested clique model are not 
directly applicable here. 

The work closest in spirit to ours is the recent work of Woodruff and Zhang m ■ This paper 
considers a number of basic statistical and graph problems in a distributed message-passing model 
similar to the /c-machine model. However, there are some important differences. First, their model 
is asynchronous, and the cost function is the communication complexity, which refers to the total 
number of bits exchanged by the machines during the computation. Second, a worst-case distribution 
of the input is assumed, while we assume a random distribution. Third, which is an important 
difference, they assume an edge partition model for the problems on graphs, that is, the edges 
of the graph (as opposed to its vertices) are partitioned across the k machines. In particular, for 
the connectivity problem, they show a message complexity lower bound of f l(nk) which essentially 
translates to a El(n/k) round lower bound in the /c-machine model; it can be shown by using their 
proof technique that this lower bound also applies to the random edge partition (REP) model, where 
edges are partitioned randomly among machines, as well. On the other hand, it is easy to show an 
0(n/k ) upper bound for the connectivity in the REP model for connectivity and MSTj^] Hence, in 
the REP model, Q{n/k ) is a tight bound for connectivity and other related problems such as MST. 
However, in contrast, in the RVP model (arguably, a more natural partition model), we show that 
0(n//c 2 ) is the tight bound. Our results are a step towards a better understanding of the complexity 
of distributed graph computation vis-a-vis the partition model. 

From the technical point of view, King et al. |21j also use an idea similar to linear sketching. 
Their technique might also be useful in the context of the /c-machine model. 

2 The Connectivity Algorithm 

In this section we present our main result, a Monte Carlo randomized algorithm for the /c-machine 
model that determines the connected components of an undirected graph G correctly with high 
probability and that terminates in 0(n/k 2 ) rounds with high probability. This algorithm is optimal, 
up to polylog(n)-factors, by virtue of a lower bound of El(n/k 2 ) rounds (22! . 

Before delving into the details of our algorithm, as a warm-up we briefly discuss simpler, but 
less efficient, approaches. The easiest way to solve any problem in our model is to first collect all 
available graph data at a single machine and then solve the problem locally. For example, one could 
first elect a referee among the machines, which requires 0(1) rounds [53], and then instruct every 
machine to send its local data to the referee machine. Since the referee machine needs to receive 
0{m) information in total but has only k — 1 links of bounded bandwidth, this requires El(m/k) 
rounds. 

A more refined approach to obtain a distributed algorithm for the /c-machine model is to use 
the Conversion Theorem of [22], which provides a simulation of a congested clique algorithm A in 

s The high-level idea of the MST algorithm in the REP model is: (1) First “filter” the edges assigned to one machine 
using the cut and cycle properties of a MST HU; this leaves each machine with 0(n) edges; (2) Convert this edge 
distribution to a RVP which can be accomplished in 0(n/k) rounds via hashing the vertices randomly to machines 
and then routing the edges appropriately; then apply the RVP bound. 
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0(M/k 2 + A'T/k ) rounds in the fc-nrachine model, where M is the message complexity of A, T is 
its round complexity, and A' is an upper bound to the total number of messages sent (or received) 
by a single node in a single round. (All these parameters refer to the performance of A in the 
congested clique model.) Unfortunately, existing algorithms (e.g., [HI 02]) typically require A' to 
scale to the maximum node degree, and thus the converted time complexity bound in the fc-machine 
model becomes Ct(n/k) at best. Therefore, in order to break the Cl(n/k) barrier, we must develop 
new techniques that directly exploit the additional locality available in the fc-machine model. 

In the next subsection we give a high level overview of our algorithm, and then formally present 
all the technical details in the subsequent subsections. 


2.1 Overview of the Algorithm 


Our algorithm follows a Boruvka-style strategy [6], that is, it repeatedly merges adjacent components 
of the input graph G, which are connected subgraphs of G, to form larger (connected) components. 
The output of each of these phases is a labeling of the nodes of G such that nodes that belong to 
the same current component have the same label. At the beginning of the first phase, each node is 
labeled with its own unique ID, forms a distinct component, and is also the component proxy of its 
own component. Note that, at any phase, a component contains up to n nodes, which might be 
spread across different machines; we use the term component part to refer to all those nodes of the 
component that are held by the same machine. Hence, at any phase every component is partitioned 
in at most k component parts. At the end of the algorithm each vertex has a label such that any 
two vertices have the same label if and only if they belong to the same connected component of G. 

Our algorithm relies on linear graph sketches as a tool to enable communication-efficient merging 
of multiple components. Intuitively speaking, a (random) linear sketch s n of a node it’s graph 
neighborhood returns a sample chosen uniformly at random from it’s incident edges. Interestingly, 
such a linear sketch can be represented as matrices using only 0(polylog(n)) bits [171 321 . A crucial 
property of these sketches is that they are linear: that is, given sketches s u and s„, the combined 
sketch s u + s.y (“+” refers to matrix addition) has the property that, w.h.p., it yields a random 
sample of the edges incident to (it, v) in a graph where we have contracted the edge (it, v) to a single 
node. We describe the technical details in Section 12.31 

We now describe how to communicate these graph sketches in an efficient manner: Consider a 
component C that is split into j parts Pi, P%, ..., Pj, the nodes of which are hosted at machines 
Mi, M 2 ,..., Mj . To find an outgoing edge for C, we first instruct each machine Mj to construct 
a linear sketch of the graph neighborhood of each of the nodes in part Pj. Then, we sum up 
these |Pj| sketches, yielding a sketch s p i for the neighborhood of part Pj. To combine the sketches 
of the j distinct parts, we now select a random component proxy machine Mc, r f° r the current 
component C at round r (see Section 2.2). Next, machine Mj sends s p i to machine Mcy, note 


that this causes at most k messages to be sent to the component proxy. Finally, machine 
computes s <7 = Yll= 1 s gi an d then uses s <7 to sample an edge incident to some node in C, which, 
by construction, is guaranteed to have its endpoint in a distinct component C'. (See Section [2~4] ) 
At this point, each component proxy has sampled an inter-component edge inducing the edges 
of a component graph C where each vertex corresponds to a component. To enable the efficient 
merging of components, we employ the distributed random ranking (DRR) technique of [8] to break 
up any long paths of C into more manageable directed trees of depth O(logn). To this end, every 



component chooses a rank independently and uniformly at random from [0, l]£jand each component 
(virtually) connects to its neighboring component (according to C) via a (conceptual) directed edge 
if and only if the latter has a higher rank. Thus, this process results in a collection of disjoint rooted 
trees, rooted at the node of highest (local) rank. We show in Section 2.5 that each of such trees has 
depth O(logn). 

The merging of the components of each tree T proceeds from the leafs upward (in parallel for 
each tree). In the first merging phase, each leaf Cj of T merges with its parent C' by relabeling the 
component labels of all of their nodes with the label of C' . Note that the proxy Mq :i knows the 
labeling of C' , as it has computed the outgoing edge from a vertex in Cj to a vertex in C' . Therefore, 
machine Mcc sends the label of Cj to all the machines that hold a part of Cj. In Section 


2.5 


we 


show that this can be done in parallel (for all leafs of all trees) in 0(n/k 2 ) rounds. Repeating this 
merging procedure O(logn) times, guarantees that each tree has been merged to a single component. 

Finally, in Section [276] we prove that Oflogn ) repetitions of the above process suffice to ensure 
that the components at the end of the last phase correspond to the connected components of the 
input graph G. 


2.2 Communication via Random Proxy Machines 

Recall that our algorithm iteratively groups vertices into components and subsequently merges 
such components according to the topology of G. Each of these components may be split into 
multiple component parts spanning multiple machines. Hence, to ensure efficient load balancing 
of the messages that machines need to send on behalf of the component parts that they hold, the 
algorithm performs all communication via proxy machines. 

Our algorithm proceeds in phases, and each phase consists of iterations. Consider the p-iteration 
of the j- th phase of the algorithm, with p,j > 1. We construct a “sufficiently” random hash function 
hj^p, such that, for each component C, the machine with ID hj tP (C) € [k] is selected as the proxy 
machine for component C. First, machine Mi generates t = Q(n/k) random bits from its private 
source of randomness. M\ will distribute these random bits to all other machines via the following 
simple routing mechanism that proceeds in sequences of two rounds. M\ selects k bits b%, 62 ,..., b^_ 1 
from the set of its £ private random bits that remain to be distributed, and sends bit bi across its 
i-th link to machine M l+ \. Upon receiving bi, machine Mj +1 broadcasts bi to all machines in the 
next round. This ensures that bits b\, 62 ,..., b^-i become common knowledge within two rounds. 
Repeating this process to distribute all the £ = ®(n/k ) bits takes 0(n/k 2 ) rounds, after those 
all the machines have the l random bits generated by M±. We leverage a result of |4| (cf. in its 
formulation as Theorem 2.1 in 0 ), which tells us that we can generate a random hash function 
such that it is d-wise independent by using only 0(d log n) true random bits. We instruct machine 
Mi to disseminate d = Ilogn = npolylog(?r)/fc of its random bits according to the above routing 
process and then each machine locally constructs the same hash function hj :P , which is then used to 
determine the component proxies throughout iteration p of phase j. 

We now show that communication via such proxy machines is fast in the /c-machine model. 

Lemma 1. Suppose that each machine M generates a message of size 0(polylog(n)) bits for each 
component part residing on M; let nii denote the message of part Pi and let C be the component 
of which Pi is a part. If each mi is addressed to the proxy machine Me of component C, then all 
messages are delivered within 0(n/k 2 ) rounds with high probability. 

6 It is easy to see that an accuracy of 0(logn) bits suffices to break ties w.h.p. 
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Proof. Observe that, except for the very first phase of the algorithm, the claim does not immediately 
follow from a standard balls-into-bins argument because not all the destinations of the messages are 
chosen independently and uniformly at random, as any two distinct messages of the same component 
have the same destination. 

Let us stipulate that any component part held by machine M t is the z-tli component part of its 
component, and denote this part with P, ? , i £ [k], j £ [n], where Pij = 0 means that in machine i 
there is no component part for component j. Suppose that the algorithm is in phase j' and iteration 
p. By construction, the hash function hy p is Q(n/k)- wise independent, and all the component parts 
held by a single machine are parts of different components. Since Mi has at most Q(n/k) distinct 
component parts w.h.p., it follows that all the proxy machines selected by the component parts 
held by machine Mi are distributed independently and uniformly at random. Let y be the number 
of distinct component parts held by a machine Mi that is, y = \{Pip : Pij / 0}| = 0(n/k) (w.h.p.). 

Consider a link of Mi connecting it to another machine M\. Let Xt be the indicator variable 
that takes value 1 if M\ is the component proxy of part t (of M % ), and let Xt = 0 otherwise. Let 
X = 'ffi-i Xi be the number of component parts that chose their proxy machine at the endpoint of 
link (Mi, Mi). Since Pr(Xj = 1) = 1 /(k — 1), we have that the expected number of messages that 
have to be sent by this machine over any specific link is E[X] = y/(k — 1). 

First, consider the case y > 11/clogn. As the Xfs are Q(n/k)- wise independent, all proxies by 
the component parts of Mi are chosen independently and thus we can apply a standard Chernoff 
bound (see, e.g., [33]), which gives 

Pr (x > -A—) < < 1 

v “ 4(A) - l)y - n 2 

By applying the union bound over the k < n machines we conclude that w.h.p. every machine sends 
0(n/k 2 ) messages to each proxy machine, and this requires 0(n/k 2 ) rounds. 

Consider now the case y < llfclogn. It holds that 6E[X] = 6 y/(k — 1) < 6 • llklogn/(k — 1) < 
132 log n, and thus, by a standard Chernoff bound, 

Pr (X > 132 logn) < 2“ 1321ogn = 

Analogously to the first case, applying the union bound over the k < n machines yields the result. □ 

2.3 Linear Graph Sketches 

As we will see in Section |2.5[ our algorithm proceeds by merging components across randomly chosen 
inter-component edges. In this subsection we show how to provide these sampling capabilities in a 
communication-efficient way in the ^-machine model by implementing random linear graph sketches. 
Our description follows the notation of [52] . 

Recall that each vertex u of G is associated with a unique integer ID from [n] (known to its 
home machine) which, for simplicity, we also denote by u{^] For each vertex u we define the incidence 
vector a u 6 {—1,0,1}( 2 ) of u, which describes the incident edges of u, as follows: 

{ 1 if u = x < y and ( x , y) G E, 

-1 if x < y = u and ( x , y) G E, 

0 otherwise. 

7 Note that the asymptotics of our results do not change if the size of the ID space is 0(polylog(n)). 
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Note that the vector a u + a v corresponds to the incidence vector of the contracted edge (u,v). 
Intuitively speaking, summing up incidence vectors “zeroes out” edges between the corresponding 
vertices, hence the vector YIugC a « represents the outgoing edges of a component C. 

Since each incidence vector a u requires polynomial space, it would be inefficient to directly 
communicate vectors to component proxies. Instead, we construct a random linear sketch s u of 
polylog(n)-size that has the property of allowing us to sample uniformly at random a nonzero entry 
of a u (i.e., an edge incident to u). (This is referred to as f?o- sam pli n g hi the streaming literature, see 
e.g. [55].) It is shown in m that to-sampling can be performed by linear projections. Therefore, 
at the beginning of each phase j of our algorithm, we instruct each machine to to create a new 
(common) polylog(n) X Q) sketch matrix Lj, which we call phase j sketch Then, each 

machine M creates a sketch s u = Lj ■ a u for each vertex u that resides on M. Hence, each s u can 
be represented by a polylogarithmic number of bits. 

Observe that, by linearity, we have Lj ■ a u + Lj ■ a v = Lj ■ (a u + a„). In other words, a crucial 
property of sketches is that the sum s u + s„ is itself a sketch that allows us to sample an edge 
incident to the contracted edge (u,v). We summarize these properties in the following statement. 

Lemma 2. Consider a phase j, and let P a subgraph of G induced by vertices {u \,..., ui}. Let 
s Ul ,..., s U£ be the associated sketches of vertices in P constructed by applying the phase j sketch 
matrix to the respective incidence vectors. Then, the combined sketch s p = Yli=i s «i can be 
represented using 0(polylog(n)) bits and, by querying s p, it is possible (w.h.p.) to sample a random 
edge incident to P (in G) that has its other endpoint in G\P. 


Constructing Linear Sketches Without Shared Randomness. Our construction of the linear 
sketches described so far requires 0(n) fully independent random bits that would need to be shared 
by all machines. It is shown in Theorem 1 (cf. also Corollary 1) of [TO] that it is possible to construct 
such an £o- sam pl er (having the same linearity properties) by using 0(n) random bits that are only 


©(logn)-wise independent. Analogously as in Section 2.2, we can generate the required 0(log 2 n 


true random bits at machine M\, distribute them among all other machines in 0(1) rounds, and then 
invoke Theorem 2.1 of [5] at each machine in parallel to generate the required (shared) 0(logn)-wise 
independent random bits for constructing the sketches. 


2.4 Outgoing Edge Selection 

Now that we know how to construct a sketch of the graph neighborhood of any set of vertices, we 
will describe how to combine these sketches in a communication-efficient way in the fc-machine 
model. The goal of this step is, for each (current) component C, to find an outgoing edge that 
connects C to some other component C'. 

Recall that C itself might be split into parts P\, P2,.... Pj across multiple machines. Therefore, 
as a first step, each machine M t locally constructs the combined sketch for each part that resides in 
Mi. By Lemma [2] the resulting sketches have polylogarithmic size each and present a sketch of the 
incidences of their respective component parts. Next, we combine the sketches of the individual 
parts of each component C to a sketch of C, by instructing the machines to send the sketch of 
each part Pi (of component C) to the proxy machine of C. By virtue of Lemma [lj all of these 
messages are delivered to the component proxies within 0(n/k 2 ) rounds. Finally, the component 

s Here we describe the construction as if nodes have access to a source of shared randomness (to create the sketch 
matrix). We later show how to remove this assumption. 
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proxy machine of C combines the received sketches to yield a sketch of C, and randomly samples an 
outgoing edge of C (see Lemma [ 2 ]). Thus, at the end of this procedure, every component (randomly) 
selected exactly one neighboring component. We now show that the complexity of this procedure is 
0{n/k 2 ) w.h.p. 

Lemma 3. Every component can select exactly one outgoing edge in 0(n/k 2 ) rounds with high 
probability. 

Proof. Clearly, since at every moment each node has a unique component’s label, each machine 
holds 0(n/k ) component’s parts w.h.p. Each of these parts selected at most one edge, and thus 
each machine “selected” 0(n/k ) edges w.h.p. All these edges have to be sent to the corresponding 
proxy. By Lemma [lj this requires 0(n/k 2 ) rounds. 

The procedure is completed when the proxies communicate the decision to each of the at most k 
components’ parts. This entails as many messages as in the first part to be routed using exactly the 
same machines’ links used in the first part, with the only difference being that messages now travel 
in the opposite direction. The lemma follows. □ 

2.5 Merging of Components 

After the proxy machine of each component C has selected one edge connecting C to a different 
component, all the neighboring components are merged so as to become a new, bigger component. 
This is accomplished by relabeling the nodes of the graph such that all the nodes in the same (new) 
component have the same label. Notice that the merging is thus only virtual, that is, component 
parts that compose a new component are not moved to a common machine; rather, nodes (and their 
incident edges) remain in their home machine, and just get (possibly) assigned a new label. 

We can think of the components along with the sampled outgoing edges as a component graph 
C. We use the distributed random ranking (DRR) technique [8] to avoid having long chains of 
components (i.e., long paths in C). That is, we will (conceptually) construct a forest of directed 
trees that is a subgraph (modulo edge directions) of the component graph C and where each tree 
has depth 0(logn){^] The component proxy of each component C chooses a rank independently and 
uniformly at random from [0,1]. (It is easy to show that 0(logn) bits provide sufficient accuracy to 
break ties w.h.p.) Now, the proxy machine of C (virtually) connects C to its neighboring component 
C' if and only if the rank chosen by the latter’s proxy is higher. In this case, we say that C' becomes 
the parent of C and C is a child of C. 

Lemma 4. After 0(n/k 2 ) rounds, the structure of the DRR-tree is completed with high probability. 

Proof. We need to show that every proxy machine of a non-root component knows its smaller-ranking 
parent component and every root proxy machine knows that it is root. Note that during this step 
the proxy machines of the child components communicate with the respective parent proxy machines. 
Moreover, the number of messages sent for determining the ordering of the DRR-trees is guaranteed 
to be 0(n) with high probability, since C has only 0(n ) edges. By instantiating Lemma[I] it follows 
that the delivery of these messages can be completed in 0(n/k 2 ) rounds w.h.p. 

9 Instead of using DRR trees, an alternate and simpler idea is the following. Let every component select a number 
in [0,1]. A merging can be done only if the outgoing edge (obtained from the sketch) connects a component with ID 0 
to a component with ID 1. One can show that this merging procedure also gives the same time bound. 
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Since links are bidirectional, the parent proxies are able to send their replies within the same 
number of rounds, by re-running the message schedule of the child-to-parent communication in 
reverse order. □ 

If a component has the highest rank among all its neighbors (in C), we call it a root component. 
Since every component except root components connects to a component with higher rank, the 
resulting structure is a set of disjoint rooted trees. 

In the next step, we will merge all components of each tree into a single new component such 
that all vertices that are part of some component in this tree receive the label of the root. Consider 
a tree T. We proceed level-wise (in parallel for all trees) and start the merging of components at 
the leafs that are connected to a (lower-ranking) parent component C. 

Lemma 5. There is a distributed algorithm that merges all trees of the DRR forest in 0(dn/k 2 ) 
rounds with high probability, where d is the largest depth of any tree. 

Proof. We proceed in d iterations by merging the (current) leaf components with their parents in 
the tree. Thus it is sufficient to analyze the time complexity of a single iteration. To this end, we 
describe a procedure that changes the component labels of all vertices that are in leaf components 
in the DRR forest to the label of the respective parent in 0(n/k 2 ) rounds. 

At the beginning of each iteration, we select a new proxy for each component C by querying the 
shared hash function hj tP (C), where p is the current iteration number. This ensures that there are 
no dependencies between the proxies used in each iteration. We know from Lemma [4] that there is 
a message schedule such that leaf proxies can communicate with their respective parent proxy in 
0(n/k 2 ) rounds (w.h.p.) and vice versa, and thus every leaf proxy knows the component label of its 
parent. We have already shown in Lemma [3] that we can deliver a message from each component 
part to its respective proxy (when combining the sketches) in 0(n/k 2 ) rounds. Hence, by re-running 
this message schedule, we can broadcast the parent label from the leaf proxy to each component 
part in the same time. Each machine that receives the parent label locally changes the component 
label of the vertices that are in the corresponding part. □ 

The following result is proved in j8j Theorem 11]. To keep the paper self-contained we also 
provide a direct and simpler proof for this result (see Appendix). 

Lemma 6 ([5, Theorem 11]). The depth of each DRR tree is O(logn) with high probability. 

2.6 Analysis of the Time Complexity 

We now show that the number of phases required by the algorithm to determine the connected 
components of the input graph is O(logn). At the beginning of each phase i, distributed across 
the k machines there are q distinct components. At the beginning of the algorithm each node is 
identified as a component, and thus Co = n. The algorithm ends at the completion of phase ip, 
where is the smallest integer such that c v = cc(G), where cc{G) denotes the number of connected 
components of the input graph G. If pairs of components were merged in each phase, it would be 
straightforward to show that the process would terminate in at most O(logn) phases. However, 
in our algorithm each component connects to its neighboring component if and only if the latter 
has a higher rank. Nevertheless, it is not difficult to show that this slightly different process also 
terminates in O(logn) phases w.h.p. (that is, components gets merged “often enough”). The 
intuition for this result is that, since components’ ranks are taken randomly, for each component the 
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probability that its neighboring component has a higher rank is exactly one half. Hence, on average 
half of the components will not be merged with their own neighbor: each of these components thus 
becomes a root of one component, which means that, on average, the number of new components 
will be half as well. 

Lemma 7. After 12 log n phases, the component labels of the vertices correspond to the connected 
components of G with high probability. 

Proof. Replace the cfs with corresponding random variables CYs, and consider the stochastic 
process defined by the sequence Co, C \,..., C, f . Let C, be the random variable that counts the 
number of components that actually participate at the merging process of phase i, because they 
do have an outgoing edge to another component. Call these components participating components. 
Clearly, by definition, Ci < Ci. 

We now show that, for every phase i € \ip — 1], E[E[Ci+i | Cff] < E[C7j]/2. To this end, fix a 
generic phase i and a random ordering of its C', participating components. Define random variables 
1 ,1, X t 2 , ■ ■., X i q , where Xij takes value 1 if the j-th participating component will be a root of 

a participating tree/component for phase i + 1, and 0 otherwise. Then, (Y+i | Ci = Ylj=i Xi,j is 
the number of participating components for phase i + 1. As we noticed before, for any i £ [p — 1] 
and j £ \Cf], the probability that a participating component will not be merged to its neighboring 
component, and thus become a root of a tree/component for phase i +1 is exactly one half. Therefore, 

Pr (Xij = 1) < 1/2. 

Hence, by the linearity of expectation, we have that 


E[Q+1 I Ci] = [Xij] = = 1) < 

j= 1 i =1 


Then, using again the linearity of expectation, 


E[E[Q+i | Ci]] < E 


E [Ci] 
2 


We now leverage this result to prove the claimed statement. Let us call a phase successful if it 
reduces the number of participating components by a factor of at most 3/4. By Markov’s inequality, 
the probability that phase i is not successful is 


/ 


.=..\ 


Pr(E[C i+1 \Ci]>-E[Ci]\ < 


V 


< 


E[E[C i+ i | Cj\] 
(3/4)E [Ci] 
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and thus the probability that a phase of the algorithm is successful is at least 1/3. Now consider 
a sequence of 12 log n phases of the algorithm. We shall prove that within that many phases the 
algorithm w.h.p. has reduced the number of participating components a sufficient number of times so 
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that the algorithm has terminated, that is, <p < 12 logn w.h.p. Let be an indicator variable that 
takes value 1 if phase i is successful, and 0 otherwise (this also includes the case that the z-th phase 
does not take place because the algorithm already terminated). Let X = Xu=i° Sn be the number 
of successful phases out of the at most 12 log n phases of the algorithm. Since Pr(Xj = 1) > 1/3, by 
the linearity of expectation we have that 

12 log n 12 log n 

E[X]= E[Xj] = Pr(Xj = 1) > = 4 logn. 

i —1 i =1 

As the Xfs are independent we can apply a standard Chernoff bound, which gives 

Pr(A < logn) < e -41ogn(3/4) 2 /2 = e -f logn < I. 

n 

Hence, with high probability 12logn phases are enough to determine all the components of the 
input graph. □ 

Theorem 1. There is a distributed algorithm in the k-machine model that determines the connected 
components of a graph G in 0(n/k 2 ) rounds with high probability. 

Proof. By Lemma [7J the algorithm finishes in O(logn) phases with high probability. To analyze the 
time complexity of an individual phase, recall that it takes 0(n/k 2 ) rounds to sample an outgoing 
edge (see Lemma [ 3 }) . Then, building the DRR forest requires 0{n/k 2 ) additional rounds, according 
to Lemma|4j Merging each DRR tree T in a level-wise fashion (in parallel) takes 0(dn/k 2 ) rounds 
(see Lemma [5]), where d is the depth of T which, by virtue of Lemma [bj is bounded by O(logn). 
Since each of these time bounds hold with high probability, and the algorithm consists of O(logn) 
phases with high probability, by the union bound we conclude that the total time complexity of the 
algorithm is 0(n/k 2 ) with high probability. □ 

We conclude the section by noticing that it is easy to output the actual number of connected 
components after the termination of our algorithm: every machine just needs to send “YES” directly 
to the proxies of each of the components’ labels it holds, and subsequently such proxies will send 
the labels of the components for which they received “YES” to one predetermined machine. Since 
the communication is performed via the components’ proxies, it follows from Lemma [l] that the first 
step takes 0(n/k 2 ) rounds w.h.p., and the second step takes only O(logn) rounds w.h.p. 

3 Applications 

In this section we describe how to use our fast connectivity algorithm as a building block to solve 
several other fundamental graph problems in the ^-machine model in time 0{n/k 2 ). 

3.1 Constructing a Minimum Spanning Tree 

Given a weighted graph where each edge e = (u, v) has an associated weight w(e), initially known to 
both the home machines of u and v, the minimum spanning tree (MST) problem asks to output a 
set of edges that form a tree, connect all nodes, and have the minimum possible total weight. Klauck 
et al. [22] show that Cl(n/l:) rounds are necessary for constructing any spanning tree (ST), assuming 
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that, for every spanning tree edge e = (u,v), the home machine of u and the home machine of v 
must both output (it, v) as being part of the ST. Here we show that we can break the Q(n/k) barrier, 
under the slightly less stringent requirement that each spanning tree edge e = (it, v) is returned by 
at least one machine, but not necessarily by both the home machines of u and v. 

Our algorithm mimics the multi-pass MST construction procedure of [2], originally devised for 
the (centralized) streaming model. To this end we modify our connectivity procedure of Section [2j 
by ensuring that when a component proxy C chooses an outgoing edge e, this is the minimum 
weight outgoing edge (MWOE) of C with high probability. 

We now describe the i-th phase of this MST construction in more detail. Analogously to the 
connectivity algorithm in Section [2j the proxy of each component C determines an outgoing edge 
eo which, by the guarantees of our sketch construction (Lemma [2]), is chosen uniformly at random 
from all possible outgoing edges of C. 

We then repeat the following edge-elimination process t = 0(logn) times: The proxy broadcasts 
w(eo) to every component part of C. Recall from Lemma [3] that this communication is possible in 
0(n/k 2 ) rounds. Upon receiving this message, the machine M of a part P of C constructs a new 
sketch s u for each u € P, but first zeroes out all entries in a u that refer to edges of weight > w(eo). 
(See Section 2.3 for a more detailed description of a u and s u .) Again, we combine the sketches of 
all vertices of all parts of C at the proxy of C, which in turn samples a new outgoing edge ei for 
C. Since each time we sample a randomly chosen edge and eliminate all higher weight edges, it is 
easy to see that the edge e* is the MWOE of C w.h.p. Thus, the proxy machine of C includes the 
edge et as part of the MST output. Note that this additional elimination procedure incurs only a 
logarithmic time complexity overhead. 

At the end of each phase, we proceed by (virtually) merging the co mpo nents along their MWOEs 
in a similar manner as for the connectivity algorithm (see Section 2.5), thus requiring 0(n/k 2 ) 
rounds in total. 

Let E be the set of added outgoing edges. Since the components of the connectivity algorithm 
eventually match the actual components of the input graph, the graph H on the vertices V(G) 
induced by E connects all vertices of G. Moreover, since components are merged according to the 
trees of the DRR-process (see Section 2.5), it follows that H is cycle-free. 

We can now fully classify the complexity of the MST problem in the /c-machine model: 


Theorem 2. There exists an algorithm for the k-machine model that outputs an MST in 

(a) 0(n/k 2 ) rounds, if each MST-edge is output by at least one machine, or in 

(b) 0(n/k) rounds, if each MST-edge e is output by both machines that hold an endpoint of e. 
Both bounds are tight up to polylogarithmic factors. 


3.2 O (log n)- Approximation for Min-Cut 

Here we show the following result for the min-cut problem in the /c-machine model. 

Theorem 3. There exists an O (log n)-approximation algorithm for the min-cut problem in the 
k-machine model that runs in 0(n/k 2 ) rounds with high probability. 

Proof. We use exponentially growing sampling probabilities for sampling edges and then check 
connectivity, leveraging a result by Karger [18]. This procedure was proposed in m in the classic 
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CONGEST model, and can be implemented in the /c-machine model as well, where we use our fast 
connectivity algorithm (in place of Thurimella’s algorithm [32] used in [T5]). The time complexity 
is dominated by the connectivity-testing procedure, and thus is 0(n/k 2 ) w.h.p. □ 

3.3 Algorithms for Graph Verification Problems 

It is well known that graph connectivity is an important building block for several graph verification 
problems (see, e.g., HU)- We now analyze some of such problems, formally defined, e.g., in Section 2.4 
of sm, in the /c-machine model. 

Theorem 4. There exist algorithms for the k-machine model that solve the following verification 
problems in 0{n/k 2 ) rounds with high probability: spanning connected subgraph, cycle containment, 
e-cycle containment, cut, s-t connectivity, edge on all paths, s-t cut, bipartiteness. 

Proof. We discuss each problem separately. 

Cut verification: remove the edges of the given cut from G, and then check whether the resulting 
graph is connected. 

s-t connectivity verification: run the connectivity algorithm and then verify whether s and t 
are in the same connected component by checking whether they have the same label. 

Edge on all paths verification: since e lies on all paths between u and v iff u and v are dis¬ 
connected in G \ {e}, we can simply use the s-t connectivity verification algorithm of previous 
point. 

s-t cut verification: to verify if a subgraph is an s-t cut, simply verify s-t connectivity of the 
graph after removing the edges of the subgraph. 

Bipartiteness verification: use the connectivity algorithm and the reduction presented in Sec¬ 
tion 3.3 of [2]- 

Spanning connected subgraph, cycle containment, and e-cycle containment verification: 

these also follow from the reductions given in HU. □ 

4 Lower Bounds for Verification Problems 

In this section we show that Gl(n/k 2 ) rounds is a fundamental lower bound for many graph verification 
problems in the /c-machine model. To this end we will use results from the classical theory of 
communication complexity [23], a popular way to derive lower bounds in distributed message-passing 

models HU |36l [371- 

Even though many verification problems are known to satisfy a lower bound of El(D + y/n) in 
the classic distributed CONGEST model HU, the reduction of HU encodes a 0(y / n)-instance of set 
disjointness, requiring at least one node to receive @(y/n) information across a single short “highway” 
path or via Q(y/n) longer paths of length Q(y/n). Moreover, we assume the random vertex partition 
model, whereas the results of m assume a worst case distribution. Lastly, any pair of machines can 
communicate directly in the /c-machine model, thus breaking the El(D) bound for the CONGEST 
model. 

Our complexity bounds follow from the communication complexity of 2-player set disjointness 
in the random input partition model (see [22]). While in the standard model of communication 
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complexity there are 2 players, Alice and Bob, and Alice (resp., Bob) receives an input vector X 
(resp., Y) of b bits [23], in the random input partition model Alice receives X and, in addition, each 
bit of Y has probability 1 /2 to be revealed to Alice. Bob’s input is defined similarly with respect to 
X. In the set disjointness problem, Alice and Bob must output 1 if and only if there is no index i 
such that X\i\ = Y[i] = 1. The following result holds. 

Lemma 8 ( [221 Lemma 3.2]). For some constant e > 0, every randomized communication protocol 
that solves set disjointness in the random input partition model of 2-party communication complexity 
with probability at least 1 — e, requires H( 6 ) bits. 

Now we can show the main result of this section. 

Theorem 5. There exists a constant 7 > 0 such that any 7 -error algorithm A has round complexity 
of Cl/n/k 2 ) on an n-node vertex graph of diameter 2 in the k-machine model, if A solves any 
of the following problems: connectivity, spanning connected subgraph, cycle containment, e-cycle 
containment, s-t-connectivity, cut, edge on all paths, and s-t-cut. 

Proof. The high-level idea of the proof is similar to the simulation theorem of HD- We present the 
argument for the spanning connected subgraph problem defined below. The remaining problems 
can be reduced to the SCS problem using reductions similar to those in la¬ 
in the spanning connected subgraph (SCS) problem we are given a graph G and a subgraph 
H C G and we want to verify whether H spans G and is connected. We will show, through a 
reduction from 2-party set disjointness, that any algorithm for SCS in the fc-machine model requires 
Cl(n/k 2 ) rounds. 

Given an instance of the 2-party set disjointness problem in the random partition model we will 
construct the following input graphs G and H. The nodes of G consist of 2 special nodes s and 
t, and nodes u\,... ,Ub, v\,Vb, for b = (n — 2)/2. (For clarity of presentation, we assume that 
(n — 2)/2 and k/2 are integers.) The edges of G consist of the edges (s, t ), ( Ui,Vi ), (s, uf), ( 17 , t), for 
1 < i < b. 

Let Ma be the set of machines simulated by Alice, and let Mb be the set of machines simulated 
by Bob, where \Ma\ = |Ads| = k/2. First, Alice and Bob use shared randomness to choose the 
machines Mx and My that receive the vertices s and t. If Mx A My, then Alice assigns t to a 
machine chosen randomly from Ma, and Bob assigns s to a random machine in Mb- Otherwise, if 
Mx and My denote the same machine, Alice and Bob output 0 and terminate the simulation. 

The subgraph H is determined by the disjointness input vectors X and Y as follows: H contains 
all nodes of G and the edges ( 17 , 17 ), (s,t), 1 < i < b. Recall that, in the random partition model, 
X and Y are randomly distributed between Alice and Bob, but Alice knows all X and Bob knows 
all of Y. Hence, Alice and Bob mark the corresponding edges as being part of H according to their 
respective input bits. That is, if Alice received X[i\ (i.e. Bob did not receive X [i]), she assigns 
the node m to a random machine in Ma and adds the edge ( s,ut ) to H if and only if X\i\ = 0. 
Similarly, the edge ( 17 , t) is added to H if and only if Y\i\ = 0 (by either Alice or Bob depending on 
who receives T[z]). See Figure [T] Note that, since X and Y were assigned according to the random 
input partition model, the resulting distribution of vertices to machines adheres to the random 
vertex partition model. Clearly, H is an SCS if and only if X and Y are disjoint. 

We describe the simulation from Alice’s point of view (the simulation for Bob is similar): Alice 
locally maintains a counter va , initialized to 1, that represents the current round number. Then, 
she simulates the run of A on each of her k/2 machines, yielding a set of t messages mi,..., rri( 
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Figure 1: The graph construction for the spanning connected subgraph problem, given a set 
disjointness instance where X[l] = 0, Y[l] = 1, X[i] = 1, Y[i] = 0, and X[b] = Y[b] = 0. The thick 
edges are the edges of subgraph H. The subgraph H contains all edges ( Ui,Vi ) (1 < i < b) and 
(s,f); the remaining edges of H are determined by the input vectors X and Y of the set disjointness 
instance. 


of O (polylog (n)) bits each that need to be sent to Bob to simulate the algorithm on his machines 
in the next round. By construction, we have that 0 < i < \k 2 / 4]. To send these messages in 
the (asynchronous) 2-party random partition model of communication complexity, Alice sends a 
message (£, (My mi, M2 ),..., (My my M/> +1 )) to Bob, where a tuple (My m;, M, + 1) corresponds to 
a message m; generated by machine Mj simulated by Alice and destined to machine Mj + i simulated 
at Bob. Upon receiving this message, Bob increases its own round counter and then locally simulates 
the next round of his machines by delivering the messages to the appropriate machines. Adding 
the source and destination fields to each message incurs an overhead of only O(logfc) = O(logn) 
bits, hence the total communication generated by simulating a single round of A is upper bounded 
by 0(k 2 ). Therefore, if A takes T rounds to solve SCS in the fc-machine model, then this gives us 
an 0(Tk 2 polylog(n))-bit communication complexity protocol for set disjointness in the random 
partition model, as the communication between Alice and Bob is determined by the communication 
across the Q(k 2 ) links required for the simulation, each of which can carry 0(polylog(n)) bits per 
round. Note that if A errs with probability at most 7, then the simulation errs with probability 
at most 7 + 1/k, where the extra 1/k term comes from the possibility that machines Mx and My 
refer to the same machine. For large enough k and small enough 7 we have 7 + 1/k < e. It follows 
that we need to simulate at least T = Q.{n/k 2 ) many rounds, since by Lemma [8] the set disjointness 
problem requires £1 (b) bits in the random partition model, when the error is smaller than e. □ 

Interestingly, our lower bounds hold even for graphs of diameter 2, which is in contrast to the 
analogous results for the classic distributed COMQS-ST model assumed in Tl'i . We remark that 
the lower bound of connectivity verification was already shown in |22| . 
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5 Conclusions 


There are several natural directions for future work. Our connectivity algorithm is randomized: it 
would be interesting to study the deterministic complexity of graph connectivity in the fc-machine 
model. Specifically, does graph connectivity admit a 0(n/k 2 ) deterministic algorithm? Investigating 
higher-order connectivity, such as 2-edge/vertex connectivity, is also an interesting research direction. 
A general question motivated by the algorithms presented in this paper is whether one can design 
algorithms that have superlinear scaling in k for other fundamental graph problems. Some recent 
results in this directions are in [38] . 

Acknowledgments. The authors would like to thank Mohsen Ghaffari, Seth Gilbert, Andrew 
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A Omitted Proofs 

A.l Proof of Lemma |6] 

Proof. Consider one phase of the algorithm, and suppose that during that phase there are n 
components. (In one phase there are c < n components, thus setting c = n gives a valid upper 
bound to the height of each DRR tree in that phase.) Each component picks a random rank from 
[0,1]. Thus, all ranks are distinct with high probability. If the target component’s rank is higher, 
then the source component connects to it, otherwise the source component becomes a root of a 
DRR tree. 

Consider an arbitrary component of the graph, and consider the (unique) path P starting form 
the node that represents the component to the root of the tree that contains it. Let |P| be the 
number of nodes of P, and assign indexes to the |P| nodes of P according to their position in the 
path from the selected node to the root. (See Figure [2]) 


X 4 



Figure 2: One DRR tree, and one path from one node to the root of the tree. Nodes of the path are 
labeled with the indicator variable associated to them, indexed by the position of the node in the 
path. 


For each i E [|P|], define X t as the indicator variable that takes value 1 if node i is not the root 
of P, and 0 otherwise. Then, X = =1 X{ is the length of the path P. Because of the random 

choice for the outgoing edge made by components’ parts, the outgoing edge of each component is 
to a random (and distinct) component. This means that, for each j < |P|, the ranks of the first j 
nodes of the path form a set of j random values in [0,1]. Hence, the probability that a new random 
value in [0,1] is higher than the rank of the j-th node of the path is the probability that the new 
random value is higher than all the j previously chosen random values (that is, the probability 
its value is the highest among all the first j values of the path), and this probability is at most 
1 /(j + 1). Thus, Pr(Xj = !)<!/(* + 1). Hence, by the linearity of expectation, the expected height 
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of a path in a tree produced by the DRR procedure is 
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Notice that the Xj’s are independent (but not identically distributed) random variables, since the 
probability that the i-th smallest ranked node is not a root depends only on the random neighbor 
that it picks, and is independent of the choices of the other nodes. Thus, applying a standard 
Chernoff bound (see, e.g., [33]) we have 

Pr(X > 61og(n + 1)) < 2- 61 °s( n+1 ) = 7 — 1 
v “ v (n + 1) 6 

Applying the union bound over all the at most n paths concludes the proof. □ 
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